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Abstract 

Comparative analysis of the genome sequences of Solatium lycopersicum variety Heinz 1 706 and S. pimpi- 
nellifolium accession LA 1 589 using MUGSY software identified 145 695 insertion -deletion (InDel) poly- 
morphisms. A selected set of 3029 candidate InDels (>2 bp) across the entire tomato genome were 
subjected to PCR validation, and 82.4% could be verified. Of 2272 polymorphic InDels between LA 1 589 
and Heinz 1 706, 61.6, 45. 2, and 3 1.6% were polymorphic in 8 accessions of S. pimpinellifolium, 4 accessions 
of S. lycopersicum var. cerasiforme, and 1 0 varieties of S. lycopersicum, respectively. Genetic distance was 
0.216 in S. pimpinellifolium, 0.202 in S. lycopersicum var. cerasiforme, and 0.108 in S. lycopersicum. The 
data suggested a reduction of genetic variation from S. pimpinellifolium to S. lycopersicum var. cerasiforme 
and S. lycopersicum. Cluster analysis showed that the 8 accessions of S. pimpinellifolium were in one group, 
whereas 4 accessions of S. lycopersicum var. cerasiforme and 1 0 varieties of S. lycopersicum were in the 
same group. 
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1 . Introduction 

Tomato (Solanum lycopersicum L.) is an economically 
important vegetable crop worldwide and a pre- 
eminent plant genetic analysis system. Genetic 
marker development for tomato has been conducted 
over 30 years through variousapproaches, including re- 
striction fragment length polymorphism (RFLP), 
random amplified polymorphic DNA(RAPD), amplified 
fragment length polymorphisms (AFLPs), simple se- 
quence repeat (SSR), cleaved amplified polymorphisms 
(CAPs), and conserved ortholog sets (COSs). Most 
markers developed by these approaches are based on 
DNA or cDNA polymorphisms between wild species 
and cultivated tomato, which lead to the construction 
ofthe first generation reference linkage mapsand isola- 
tion of genes of interests. 1,2 However, the ability of using 
these markers to detect polymorphisms in cultivated 



tomato is limited. 3 Recent efforts to develop new 
markers in cultivated tomato have been focus on 
single-nucleotide polymorphisms (SNPs) using in silico 
mining of expressed sequence tag database and experi- 
mental validation, 4-7 amplicon sequencing of COS 
genes, 8,9 hybridization to oligonucleotide array, 10 and 
next-generation sequencing of transcriptome or re-se- 
quencing of genome. 1 1-13 Owing to the abundance 
and wide distribution of SNPs in the whole genome 
and the availability of automatic large-scale genotyping 
platform, SNPs have widely been used in association 
analysis, 1 3-1 5 high-density SNP map construction, 7,1 6 
as well as population structure and genetic variation 
analysis 1 7-20 in cultivated tomato. 

Short insertion and deletion (InDel) polymorphisms 
are increasingly being received attention in human 
because they are the second abundant form of 
genetic variation and can influence multiple human 
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phenotypes including diseases. 21-25 Therefore, great 
efforts have been put on identification, mapping, 
and functional analysis of InDels in the human 
genome. 26-28 Similar work has been done in other 
species, such as Arabidopsis and rice. 29-33 In tomato, 
a total of 749 966 putative InDels of 3-300 bp have 
been identified by comparing the genome sequences 
of Solatium pimpinellifolium accession LA 1 589 and S. 
lycopersicum variety Heinz 1 706, 34 and more than 80 
000 putative InDels of 1 -1 5 bp have been discovered 
by comparative analysis of transcriptome between 
wild species S. galapagense and cultivated tomato. 35 
However, less work on discovery of InDels in cultivated 
tomato has been done. 

The availability of the whole genome or transcrip- 
tome sequences provides a potential to identify InDels 
in silico. We here developed a pipeline to identify 
InDels by comparative analysis of the two available 
genome sequences of LA 1 589 and Heinz 1 706. A 
total of 3029 candidate InDels were subjected to ex- 
perimental validation by PCR amplification of 
genomic DNA in a collection of 22 tomato lines. The 
main objective of this study was to develop easy-using 
markers for genetic study and marker-assisted selec- 
tion in cultivated tomato. 

2. Materials and methods 

2. 1 . Plant materials and DNA isolation 

A panel of 2 2 tomato genotypes comprising of culti- 
vated tomato (S. lycopersicum) and its wild relatives 
were used to validate InDel polymorphisms. These 
inbred lines were selected to represent a diverse collec- 
tion includingeightaccessionsofS. pimpinellifolium ,five 
processing varieties, one green house cultivar, four fresh 
market cultivars, and four S. lycopersicum var. cerasi- 
forme accessions (Table 1). Nine of them were used 
for SNP detection in our previous study. 9 The eight S. 
pimpinellifolium accessions were selected from the core 
collection or sources being used for genetic studies and 
were used to detect polymorphisms of candidate InDels 
within the species. Genomic DNA was isolated from 
fresh-collected young leaves of at least eight plants for 
each genotype using the modified CTAB method. 36 

2.2. Prediction of InDels between LA 1 589 and 
Heinz 7 706 

The genom ic DNA seq uences of S. pimpinellifolium ac- 
cession LA 1 589 (Spimpinellifolium_genome.contigs. 
fasta.gz) and S. lycopersicum variety Heinz 1 706 
(S_lycopersicum_chromosomes.2.40.fa.gz) were down- 
loaded to a local computer from the SOL Genomics 
Network (SGN, http://solgenomics.net/, 1 9 February 
2014, date last accessed). The genomic DNA sequence 
contigs of LA 1 589 were assigned to Heinz 1 706 



genome using local MUGSY 37 downloaded from 
Sourceforge (http://mugsy.sourceforge.net/, 1 9 February 
2014, date last accessed). InDel polymorphisms refer- 
ring to Heinz 1 706 were mined from the alignments 
using custom PERL scripts. Flanking sequences of 

1 00 bp from each side of candidate InDels were 
extracted from Heinz 1 706 sequences for insertion 
and LA 1 589 sequences for deletion. The flanking 
sequences were then blasted against LA 1 589 
sequences for deletion or Heinz 1 706 sequences for 
insertion using local BLASTall with an E-value of e~ 20 
to remove hits with low similarity. The types (insertion 
or deletion), lengths, nucleotides, and chromosomal 
positions of InDels were extracted using a PERL script 
with the highest score of blast search. 

2.3. Selection of In Dels for validation and primer design 
Our initial goal was to verify 3 000 candidate In Dels of 

2 bpor longerevenlydistributingon 1 2 chromosomes. 
Based on the genome sequenced for Heinz 1 706 
(760 Mb), 34 the average distance between two adja- 
cent InDels would be ~2 50 kb. The number of InDels 
to be validated was determined by the length of each 
chromosome (Table 2). However, we found that the 
InDels were not always evenly distributed on chromo- 
somes and hotspots have high levels of InDels than 
other regions. Therefore, we tried to acquire an InDel 
per 200 kb in each chromosome using a PERL script. If 
a region on a chromosome did not have InDel variation, 
the PERL script would make 200 plus 1 00 kb on circu- 
lation until it matched. 

To design primers for PCR validation of InDels, 
flanking sequences of 1 00 bpforeach side of candidate 
InDels were extracted. Primers were designed using 
local Primer3 38 downloaded from Sourceforge (http:// 
sou rceforge.net/ project/showfiles.php?group_id= 
1 1 2461 , 1 9 February 2014, date last accessed) with 
PCR product length 100-200 bp and the optimal 
length of primer sequence of 20 bp. Several primer 
pairs were designed for each InDel. The best primer 
pair was selected based on the optimal GC content of 
40-60% and the difference of GC content between 
forward and reverse primers <10%. All the process 
was carried out using custom PERL scripts. Primers 
were synthesized at Sunbiotech Company (Beijing, 
China) or Sangong Company (Beijing, China). 

2.4. Validation of InDels using PCR 

The PCR tech n iq ue was ada pted to va I idate the ca nd i- 
date InDels. All synthesized primers were first used 
to amplify genomic DNA of tomato lines LA 1 589 
and Heinz 1 706. Only primers that successfully ampli- 
fied a product and had length polymorphisms were 
then used to detect polymorphisms in the 22 tomato 
genotypes. 
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Table 1 . Description of plant materials 



Genotype 


Species 




Market type 


Origin 


Note 


LA 1 269 


Solarium pimpinellifolium 




Wild 


Peru 


Resistance source for late blight (Ph-3) 


LA 1 589 


Solarium pimpinellifolium 




Wild 


Peru 


Genome sequenced, widely used for genetic studies 


PI 1 2821 6 


Solarium pimpinellifolium 




Wild 


Bolivia 


Resistance source for bacterial spot and bacterial speck 


LA0373 


Solanum pimpinellifolium 




Wild 


Peru 


Core collection 


LA 0400 


Solanum pimpinellifolium 




Wild 


Peru 


Core collection 


LA0722 


Solanum pimpinellifolium 




Wild 


Peru 


Core collection 


LA 1 582 


Solanum pimpinellifolium 




Wild 


Peru 


Core collection 


LA21 81 


Solanum pimpinellifolium 




Wild 


Peru 


Core collection 


Heinz 1 706 


Solanum lycopersicum 




Processing 


USA 


Genome sequenced 


OH 881 1 9 


Solanum lycopersicum 




Processing 


USA 


Early fruit set 


OH 9242 


Solanum lycopersicum 




Processing 


USA 


High lycopene 


Liger 87-5 


Solanum lycopersicum 




Processing 


China 


Current major variety in China 


M 82 


Solanum lycopersicum 




Processing 


Israel 


Widely used in genetic studies 


Money maker 


Solanum lycopersicum 




Greenhouse 


USA 


Widely used in genetic studies 


Fla.7600 


Solanum lycopersicum 




Fresh market 


USA 


Variety with multiple disease resistance genes 


Baieuoqianefene 


Solanum lycopersicum 




Fresh market 


China 


Previous major variety in China 


Shijifeng 


Solanum lycopersicum 




Fresh market 


China 


Previous major variety in China 


Zhongshu 5 


Solanum lycopersicum 




Fresh market 


China 


Previous major variety in China 


Black cherry 


Solanum lycopersicum var. 


cerasiforme 


Cherry 


USA 


Brown fruit 


LA 1 3 1 0 


Solanum lycopersicum var. 


cerasiforme 


Cherry 


Peru 


Salt tolerance 


LA41 33 


Solanum lycopersicum var. 


cerasiforme 


Cherry 


USA 


Core collection, salt tolerance 


PI 1 1 4490 


Solanum lycopersicum var. cerasiforme 


Cherry 


UK 


Yellow fruit, resistance to bacterial spot 



Table 2. Summary statistics for primer design, PCR amplification, and polymorphisms 



Chromosome 


Sequence 

length 

(~Mb) a 


No. of 

primers 

designed 


No. of primers 
without PCR 
amplification 


No. of primers 

without 

polymorphism 


No. of 

primers 

examined 


No. (percentage) of polymorphic InDels 

S. pimpinellifolium S. lycopersicum S. lycopersicum 
var. cerasiforme 


chr01 


90.3 


362 


63 


89 


210 


1 32 (62.9) 


38 (1 8.1) 


22 (1 0.5) 


chr02 


49.9 


207 


1 0 


22 


1 75 


98 (56.0) 


1 34 (76.6) 


75 (42.9) 


chr03 


64.8 


254 


1 9 


31 


204 


123 (60.3) 


1 28 (62.7) 


32 (1 5.7) 


chr04 


64.1 


254 


1 2 


24 


218 


1 20 (55.0) 


1 28 (58.7) 


144 (66.1) 


chr05 


65.0 


262 


33 


53 


1 76 


1 1 2 (63.6) 


1 25 (71.0) 


127 (72.2) 


chr06 


46.0 


1 81 


20 


40 


121 


99 (81.8) 


83 (68.6) 


73 (60.3) 


chr07 


65.3 


259 


39 


1 5 


205 


94 (45.9) 


26 (1 2.7) 


1 7 (8.3) 


chr08 


63.0 


252 


23 


21 


208 


160 (76.9) 


1 2 (5.8) 


1 1 (5.3) 


chr09 


67.7 


267 


21 


34 


212 


1 35 (63.7) 


80 (37.7) 


66(31.1) 


chrl 0 


64.8 


255 


1 4 


35 


206 


131 (63.6) 


1 1 1 (53.9) 


1 1 (5.3) 


chrl 1 


53.4 


214 


8 


35 


1 71 


80 (46.8) 


1 23 (71.9) 


109 (63.7) 


chrl 2 


65.5 


262 


1 0 


86 


1 66 


1 1 6 (69.9) 


40 (24.1) 


30 (1 8.1) 


Total 


759.8 


3029 


272 


485 


2272 


1400(61.6) 


1028 (45.2) 


71 7 (31 .6) 



a The sequenced genome size was obtained from Sato eta/. 34 



All PCRs were done in 1 0-|xl reaction volume using 
the method described in Wei et al. 39 Reactions were 
heated at 95°C for 5 min, followed by 32 cycles of 
30 s at 95°C, 30 s at 50-60°C depending on the T m 



values of primer pairs, and 30 s at 72°C, with a final ex- 
tension of 5 min at 72°C.The PCR products were subse- 
quently separated in 8% polyacrylamide gel and 
visualized using the silver-staining approach. 1 7 
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2.5. Data collection and analysis 

The presence or absence of each allele for each InDel 
was coded by 1 or 0, respectively, and scored for a bin- 
ary data matrix. Allele frequency of each InDel marker 
was calculated for each genotype. Nei's genetic dis- 
tance 40 was calculated for each pair of tomato geno- 
types using the programme in the software package 
PHYLIP 3.69 5 (http://evolution.genetics.washington. 
edu/phylip.html, 1 9 February 201 4, date last accessed). 
An Unweighted Pair Group Method with Arithmetic 
Mean (UPGMA) cluster analysis was performed to 
develop a dendrogram. 

The occurrences of InDels in coding regions of genes 
were examined by blasting the flanking sequences of 
1 00 bp for each side of the InDel against the tomato 
ITAG2.3_cds.fasta downloaded from SGN using a PERL 
script. 

3. Result 

3.7. Candidate InDels between LA 1 589 and 
Heinz 7 706 

A total of 1 45 695 candidate InDels were identified 
between the genome sequences of Heinz 1 706 and 
LA 1 589, of which 65 619 were insertions and 80 
076 were deletions in Heinz 1 706 (Table 3). The 
average size of predicted InDels was 4.1 bp with a 
range of 1 -94 bp, of which ~ 54.0% were 1 bp, 42.3% 
were 2-20 bp, and 3.7% were longer than 20 bp. The 
average density of InDels was one per 5.22 kb with a 
range of 4.33-6.72 kb on 12 chromosomes. The 
highest density was on chromosome 6 and the lowest 
density wason chromosome 1 2 (Table 3). The least dif- 
ference of numbersfor InDels between 1 bpand>1 bp 



was observed on chromosome 2 (101), while the 
largest wason chromosome 1 0 (1 496). 

3.2. Number of primers designed and success of 
PCR amplification 
Using the approach described in the section 
'Selection of InDels for validation and primer design' 
of Materials and methods, 3029 candidate InDels 
were selected and primers were designed for PCR valid- 
ation (Supplementary Table S1). The average physical 
distance between two adjacent InDels was 250 kb 
with a range of 241 (chromosome 2) to 255 kb 
(chromosome 3) on 12 chromosomes. PCR results 
showed that 272 primer pairs could not generate PCR 
products from the genomic DNA of both Heinz 1 706 
and LA 1 589 (Table 2) . The PCR success rate was 
91 .0%, which was consistent with our previous finding 
of 91.9% for PCR amplification of genomic DNA in 
tomato. 9 The InDel sizes of PCR products amplified by 
most primer pairs (98.5%) were as predicted. However, 
23 primer pairs showed smaller and 1 0 primer pairs 
showed larger sizes than predicted (Supplementary 
Table S1 ). In addition, 485 primer pairs did not show 
detectable polymorphisms between Heinz 1 706 and 
LA 1 589 (Table 2). The InDel sizes between 6 and 
30 bp had a high percentage (83.6%) of polymorphism 
validation, whilelnDelswithsizesof <6 bpand >30 bp 
received 78.3 or 43.3% polymorphism validation, re- 
spectively. Particularly, only one of five InDels was vali- 
dated when the size was >50bp (Supplementary 
Table S2). The primer pairs with PCR failure or non-de- 
tectable polymorphisms were excluded, and the remain- 
ing 2272 primer pairs were used for subsequent analysis. 
Therefore, the actual average distance between two adja- 
cent InDels was 334 kb with a range of 285 



Table 3. Predicted number and frequency of InDels between Heinz 1 706 and LA 1 589 



Chromosome No. of predicted InDels Frequency of InDels (kb/lnDel) 





Total 


1 bp 


>1 bp 


Total 


1 bp 


>1 bp 


chr01 


1 6 547 


8777 


7770 


5.46 


1 0.29 


1 1.62 


chr02 


10695 


5398 


5297 


4.67 


9.24 


9.42 


chr03 


1 2 842 


6779 


6063 


5.05 


9.56 


1 0.69 


chr04 


1 1 495 


61 1 2 


5383 


5.58 


1 0.49 


1 1 .91 


chr05 


1 2 148 


681 6 


5332 


5.35 


9.54 


1 2.1 9 


chr06 


10619 


5540 


5079 


4.33 


8.30 


9.06 


chr07 


1 3 426 


7386 


6040 


4.86 


8.84 


10.81 


chr08 


1 3 776 


7591 


61 85 


4.57 


8.30 


1 0.1 9 


chr09 


11417 


6251 


51 66 


5.93 


1 0.83 


1 3.1 0 


chrl 0 


1 3 390 


7443 


5947 


4.84 


8.71 


1 0.90 


chrl 1 


9587 


5221 


4366 


5.57 


10.23 


1 2.23 


chrl 2 


9753 


5432 


4321 


6.72 


1 2.06 


1 5.1 6 


Total 


145 695 


78 746 


66 949 








Average 








5.22 


9.65 


1 1.35 
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(chromosome 2) to 430 kb (chromosome 1 )on 1 2 chro- 
mosomes. 

The 2272 InDel markers generated 5025 alleles in 
the whole collection of 2 2 tomato genotypes. The 
number of alleles generated for all InDels varied from 2 
to 8 with an average of 2.2. Among the polymorphic 
InDels, most (85.3%) had two alleles, 1 0.7% had three 
alleles, and 2.7% had four alleles (Fig. 1). Only three 
and two markers had seven and eight alleles, respective- 
ly. Similarly, 84.9% polymorphic InDels inS. pimpinellifo- 
//wm,94.7%inS. lycopersicumvax. cerasiforme, and 95.8% 
in S. lycopersicum had two alleles (Fig. 1 ). 

3.3. Marker polymorphisms and distribution among 
three tomato species 
Of the 502 5 alleles amplified by 2 2 72 InDel markers, 
1930 were shared by all three species. The total 
number of alleles in each species reduced from 3941 
in S. pimpinellifolium to 343 1 in S. lycopersicum var. cer- 
asiformeand 31 1 0 in S. lycopersicum (Fig. 2). The number 
of alleles unique to each species also dramatically 



1011.00 
911,110 
80,00 
711 110 

? 6a.no 

2 58.00 

H 

I 411.00 
30.00 
20.00 
10.00 
11.80 



USulunttm piutpinclfifotium 
OS. tycapemicutti var. cerasiforme 
MS. Jyi'vpcrtiUiitti 
QAII species 



Number nf iilleles 



Figure 1. Frequency distribution of InDels (>2 bp) in Solarium 
pimpinellifolium, S. lycopersicum var. cerasiforme, and S. lycopersicum. 



$, lycopersicum var. cerasiforme 
(n=4) 
3431 alleles 



S. lycopersicum 
31 10 alleles 




152 



S. pimpinellifolium 
<n=8) 
3941 alleles 



Figure 2. Venn diagram shows the proportion of common alleles 
among Solarium pimpinellifolium, S. lycopersicum var. cerasiforme, 
and S. lycopersicum. This figure appears in colour in the online 
version of DNA Research. 



decreased from 1 382 in S. pimpinellifolium to 56 in S. 
lycopersicum var. cerasiforme and 60 in S. lycopersicum. 
Solanum pimpinellifolium shared more alleles with S. lyco- 
persicum var. cerasiforme than with S. lycopersicum. 

Pairwise comparisons revealed that almost all InDel 
markers were polymorphic between S. pimpinellifolium 
and S. lycopersicum var. cerasiforme or S. lycopersicum. 
However, the proportion of polymorphic InDels 
reduced to 53.0% between S. lycopersicum var. cerasi- 
forme and S. lycopersicum. There were 0.1-20.7% 
InDels had alleles alternatively fixed in paired species. 
In addition, 1 8.5-26.9% InDels had alleles shared by 
paired species. Proportionsof InDels with allelesspecific 
to one certain species varied from 6.1 to 44.0% (Fig. 3). 
The proportion of polymorphic InDels was 61.4- 
1 00.0% (average 84.6%) between any accession in S. 
pimpinellifolium and any genotype in S. lycopersicum, 
55.3-93.8% (average 71.5%) between any accession 
in S. pimpinellifolium and any line in S. lycopersicum 
var. cerasiforme, and 7.7-33.9% (average 19.2%) 
between any line in S. lycopersicum var. cerasiforme 
and any genotype in S. lycopersicum (Supplementary 
Table S3). 

Although the 2272 InDels almost evenly distributed 
across all 12 chromosomes (Supplementary Fig. S1), 
the distribution of polymorphic markers varied for 
three species (Supplementary Fig. S2). Solanum pimpi- 
nellifolium had a relatively even distribution of poly- 
morphic InDels on all 12 chromosomes. Solanum 
lycopersicum var. cerasiforme had the similar distribu- 
tion pattern of polymorphic InDels asS. pimpinellifolium 
on chromosomes 2, 3,4, 5, 6, 9, 1 0,and 1 1 ,butclusters 
of polymorphic InDels occurred at some regions on 
chromosomes 1 , 7, and 1 2. The distribution of poly- 
morphic InDels varied across and within chromosomes 
in S. lycopersicum. Among six chromosomes with less 
polymorphic InDels, chromosomes 1, 8, 10, and 12 
had relatively even distribution, while the long-arm 
ends of chromosomes 3 and 7 had more InDels than 
other regions. There were less In Dels at one end of chro- 
mosomes 2, 4, 5, 9, and 1 1 . However, chromosomes 5, 
9, and 1 1 showed relatively even distribution. On 
chromosome 6, the short arm had more polymorphic 
InDels than the long arm. 

The proportion of polymorphic InDels on 1 2 chro- 
mosomes ranged from 45.9 to 81 .8% inS. pimpinellifo- 
lium, 5.8 to 7 6.6% in S. lycopersicum var. cerasiforme, and 
5.3 to 72.2% inS. lycopersicum (Fig. 4). The numbers of 
polymorphic InDels considerably decreased on four 
chromosomes 1,7,8, and 1 2 in S. lycopersicum var. cer- 
asiforme and S. lycopersicum (Table 2). Furthermore, the 
proportions of polymorphic InDels on chromosomes 3 
and 1 0 were close between S. pimpinellifolium and 
S. lycopersicum var. cerasiforme, but significantly de- 
creased in S. lycopersicum (Fig. 4). Interestingly, 
increases of InDel polymorphisms were observed on 
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(S) S. pimpinellifolium vs 5, lycopersicum var. cerasiforme 




Figure 3. Pairwise comparisons of allelic variation among Solanum 
pimpinellifolium, S. lycopersicum var. cerasiforme, and S. lycopersicum. 
Pie diagrams show the proportion of 2272 InDels that fell into 
five categories: (1) InDels where a monomorphic allele was 
shared by all members in the two species; (2) InDels where 
alleles were found among the members of the two species; 
(3) InDels where a unique allele was found among members of 
the first species listed, whereas an alternative allele (found in 
both groups) was fixed in the second species; (4) InDels where a 
unique allele was found among members of the second species 
listed, whereas an alternative allele (found in both species) was 
fixed in the first species; (5) InDels where the two species were 
fixed for alternative alleles. 

chromosomes 4, 5, and 1 1 in S. lycopersicum var. cerasi- 
forme and S. lycopersicum. The proportions of poly- 
morphic InDels also increased on chromosomes 2 and 
3 in S. lycopersicum var. cerasiforme. 

3.4. Marker polymorphisms and genetic vitiation within 
three tomato species 
The proportion of polymorphic In Dels was 61 .6% in 8 
S. pimpinellifolium accessions, 4 5.2% in 4S. lycopersicum 
var. cerasiforme accessions, and 3 1 .6% in 1 0 cultivated 
tomato varieties (Table 2). However, the rate of poly- 
morphic InDels between any two genotypes was low 
with a range of 14.3-33.6% in S. pimpinellifolium, 
17.5-31.5% in S. lycopersicum var. cerasiforme, and 
1 .5-1 9.8%inS. lycopersicum (Supplementary Table S3). 



Not surprisingly, the eight accessions of S. pimpinelli- 
folium had the largest genetic variation among three 
species. The average genetic distance was 0.21 6 with 
a range from 0.178 (PI 12821 6) to 0.244 (LA1 589). 
Accessions LA 1 589 and LA 2181 had the greatest 
genetic distance with 0.394, whereas accessions PI 
1 2 821 6 and LA 03 73 had the least genetic distance 
with 0.137. The average genetic distance slightly 
reduced to 0.202 with a range from 0.1 62 (LA 41 33) 
to 0.23 7 (PI 1 1 4490) in fourS. lycopersicum var. cerasi- 
forme lines, but significantly decreased to 0.1 08 with a 
range of 0.086 (Baiguoqiangfeng) to 0.1 39 (M 82) in 
1 0 varieties of S. lycopersicum. The minimum genetic dis- 
tance wasO. 01 2 between varieties Liger 87-5 and M 82, 
followed by 0.015 between varieties Baiguoqiangfeng 
and Zhongshu 5, while the largest genetic distance was 
0.21 4 between Shijifeng and M 82. 

The dendrogram was constructed from the pairwise 
genetic distance matrices based on Nei's distance for 
22 genotypes. Two distinct groups, A and B, were 
obtained (Fig. 5). All 8 accessions of S. pimpinellifolium 
were in Group A, and 1 0 S. lycopersicum var. cerasiforme 
cultivars and 4 S. lycopersicum var. cerasiforme acces- 
sions were in Group B. The four fresh market cultivars 
clustered together. However, five processing varieties, 
one greenhouse variety, and fourS. lycopersicum var. cer- 
asiforme accessions did not form their own clades. Of 
the fourS. lycopersicum var. cerasiforme lines, LA 41 33 
clustered to three processing and one greenhouse var- 
ieties, Blackcherry clustered to two processing varieties, 
while PI 1 1 4490 and LA 1 31 0 stood alone. 

3.5. Genes with InDels in the coding region 

Blast search of flanking sequences of 2272 validated 
InDels againstthe tomato ITAG2.3_cds.fasta data iden- 
tified 56 InDels in coding regions of annotated genes 
(Supplementary Table S4), of which 64.3% were dele- 
tions in Heinz 1 706 and 35.7% were insertion in 
Heinz 1 706. Based on the sizes of InDels, 28.6% of 
InDels were frame-shift mutations, because the 
numbers of nucleotides in the InDels were indivisible 
by three. The remaining 71 .4% InDels did not result in 
frame-shift, but would cause insertion or deletion of 
some amino acids. 

4. Discussion 

Molecular markers are important to genetic study 
and marker-assisted selection. Large-scale discovery 
combining high-throughput genotyping of SNPs have 
shown its power in gene identification and breeding in 
tomato. 12 However, high costs and technical or equip- 
ment demands will still be a major obstacle for large- 
scale use of SNPs in the developing countries. 41,42 On 
the contrary, the genotyping of short InDels is relatively 
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Figure 4. Distribution of the proportion of polymorphic InDels on 1 2 chromosomes in Solarium pimpinellifolium, S. lycopersicum var. 
cerasiforme, and S. lycopersicum. 




Figure 5. The dendrogram of 22 tomato genotypes based on 2272 InDel marker data, and generated from Nei's genetic distance matrix by 
UPGMA in PHYLIP 3.695. 



easy and inexpensive with a simple PCR and electro- 
phoresis. Short InDels can also be analysed with high- 
throughput technologies 26,43,44 and in large-scale 
multiplexing. 45 As a type of genetic markers, InDels 
have been successfully used for forensic analysis 46-48 
and individual identification 44,45 in human, as well as 
genetic studies in several plant species including rice, 
wheat, citrus, and Arabidopsis. 33 Although the tomato 
genome sequences have been widely used in various 
purposes including SNP discovery, genetic mapping, 
gene prediction, gene expression, genetic diversity, 
comparative genomics, and epigenetics since their 
release, 49 identification of InDels has so far been con- 
fined to detect polymorphisms between wild species 
and cultivated tomato. 34,35 In this study, we identified 
InDels by comparative analysis of genome sequences 
between S. pimpinellifolium and S. lycopersicum, and 
then validated them in 1 0 cultivated tomato lines via 



PCR amplification. Of 2 2 72 InDels polymorphic 
between LA 1 589 and Heinz 1 706, 31 .6% were poly- 
morphic among the 10 cultivated tomato varieties 
and 1.5-19.8% were polymorphic between any 2 of 
the 1 0 cultivated tomato varieties. Based on the total 
number of InDels (145 695) between LA 1 589 and 
Heinz 1 706, we estimated that there were 21 00-28 
800 InDels between any two cultivated tomato var- 
ieties, suggesting that there were abundant InDels for 
genetic study and marker-assisted selection in the cul- 
tivated tomato. 

Precise identification of InDels in sequence databases 
depends on the strategy and the parameters used for 
data mining as well as the quality of sequence data. 
Since InDels are the dominant error type generated by 
454 pyrosequencing 50 and an InDel error rate of one 
per 6.4 kb was observed in tomato, 34 the initial work 
on identification of InDels between the genomes of LA 
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1 589 and Heinz 1 706 did not count InDels of 1 and 

2 bp to avoid overestimation of small InDels due to se- 
quencing errors. 34 Using a bioinformatic pipeline in- 
volving various comparative genomics tools, 9474 
InDels of 15-1 00 bp were identified between LA 
1 589 and Heinz 1 706, and >80% could be verified 
by PCR (Jiang et al. unpublished data, acquired from 
ftp://ftp.solgenomics.net/maps_and_markers/LippmanZ/, 
1 9 February 2014, date last accessed). In this study, a 
total of 145 695 InDels were predicted between LA 
1 589 and Heinz 1 796, which was approximate one- 
fifth of 749 966 InDels Identified in Sato et al. 34 The 
overall frequency of InDels (one per 5.22 kb) was also 
much lower than one per 1 1 0 bp in Sato et al. 34 
However, the number (91 37) of InDels of 1 5-94 bp 
was close to the results of Jiang et al., though the strat- 
egies used for InDels identification were different. 
Owing to the lack of methodology description in Sato 
et al., 4 we were not able to determine the cause of 
the difference between two studies. Two points might 
be worthy of notice. First, the lengths of putative 
InDels identified in two studies were different with 
ranges of 3-300 bp in Sato et al. 34 and 1 -94 bp in 
this study. We could not identify any InDels >94 bp 
using our methodology. Secondly, the rate of validation 
(82.4%) was close to 81 .7% obtained in Koenigeta/., 35 
though the comparisons involved in different wild 
species and cultivated varieties, indicating that ~20% 
of predicted InDels (>2 bp) were false due to sequen- 
cing error. All these suggested that our prediction 
might be more close to the real number of InDels in 
the currently available genome sequences of LA 1 589 
and Heinz 1 706. 

The polymorphic InDels evenly distributed across all 
1 2 chromosomes in S. pimpinellifolium, but appeared 
non-randomly distributed across and within chromo- 
somes in S. lycopersicum var. cerasiforme and S. lycopersi- 
cum. Domestication and selection could be one causal 
of this difference. For example, there were 38 and 35 
polymorphic InDels at the bottom (~1 1 Mb) of 
chromosome 2 inS. pimpinellifolium and S. lycopersicum 
var. cerasiforme, respectively, but only two InDels were 
polymorphic in S. lycopersicum. This might be due to 
the existence of quantitative trait loci for fruit weight 
and selection for large fruit inS. lycopersicum.^ 2 In add- 
ition, several studies have proved that the introgression 
of disease resistance genes in many cultivars has strong 
influence on SNP patterns. 1 9,51 This kind of introgres- 
sion could also cause the difference of polymorphic 
InDels distribution among three species. 

It has been suggested that domestication and in- 
breeding dramatically reduced the genetic variation 52 
and modern cultivars have less genetic variation than 
old ones in tomato. 53,54 In this study, genetic variation 
of three species was investigated using the same large 
set of InDel markers, which allowed us to compare 



genetic polymorphisms among and within species 
at the same time. The number of polymorphic 
InDels, the total number of alleles amplified by InDel 
markers, and the average genetic distance in 10S. lyco- 
persicum varieties significantly reduced comparingwith 
those in 8S. pimpinellifolium accessions, supported the 
reduction of genetic variation in cultivated tomato. The 
four S. lycopersicum var. cerasiforme accessions showed 
an intermediate amount of genetic diversity between 
S. lycopersicum and S. pimpinellifolium, which was con- 
sistent with previous findings. 55,56 However, some 
novel alleles occurred in bothS. lycopersicum var. cerasi- 
forme and S. lycopersicum, suggesting that domestica- 
tion and selection could also generate new variation. 

The occurrence of InDels in coding regions of a gene 
can either cause frame-shift or amino acid InDels, 
which most likely alternates the gene function and 
results in phenotype change. 57 A Rider mutational in- 
sertion event occurring in the first exon of the Psy1 
gene causes the early termination ofPsyl transcription 
that results in yellow flesh in the tomato r mutant. 58 A 
single-base deletion mutation in the coding region of 
SIIAA9 gene, an Aux/IAA gene involving in tomato leaf 
morphology, converts tomato compound leaves to 
simple leaves. 59 InDels occurring in the promoter 
region can also affect the gene expression. 60 Here, we 
identified 145 695 InDels between LA 1 589 and 
Heinz 1 706, and 3 1 .6% of them were polymorphic in 
cultivated tomatoes. The percentage of InDels (2.5%) 
occurring in coding regions of genes identified in this 
study was much lower than our recent work (1 9.7%) 
on comparative analysis of resistance- 1 ike genes 
between LA 1 589 and Heinz 1 706. 61 Identification of 
specific genes in our previous work other than a 
random sample in this study could cause the different 
proportions of InDels in coding regions. 

In conclusion, there are abundant short InDels in cul- 
tivated tomato. Identification and validation of this 
kind of short InDels will not only provide molecular 
markers for genetic study and marker-assisted selec- 
tion in breeding, but also provide useful information 
for gene cloning and functional analysis. 
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